Introduction

The purpose of this project is to gauge your technical skills and problem solving ability by working through something similar to a real NBA data science project. You will work your way through this R Markdown document, answering questions as you go along. Please begin by adding your name to the “author” key in the YAML header. When you’re finished with the document, come back and type your answers into the answer key at the top. Please leave all your work below and have your answers where indicated below as well. Please note that we will be reviewing your code so make it clear, concise, and avoid long printouts. Feel free to add in as many new code chunks as you’d like.

Remember that we will be grading the quality of your code and visuals alongside the correctness of your answers. Please try to use the tidyverse as much as possible (instead of base R and explicit loops). Please do not bring in any outside data, and use the provided data as truth (for example, some “home” games have been played at secondary locations, including TOR’s entire 2020-21 season. These are not reflected in the data and you do not need to account for this.) Note that the OKC and DEN 2024-25 schedules in schedule_24_partial.csv intentionally include only 80 games, as the league holds 2 games out for each team in the middle of December due to unknown NBA Cup matchups. Do not assign specific games to fill those two slots.

Note:

Throughout this document, any season column represents the year each season started. For example, the 2015-16 season will be in the dataset as 2015. We may refer to a season by just this number (e.g. 2015) instead of the full text (e.g. 2015-16).

Answers

Part 1

Question 1: 26 4-in-6 stretches in OKC’s draft schedule.

Question 2: 25.1 4-in-6 stretches on average.

Question 3:

  • Most 4-in-6 stretches on average: Charlotte Hornets (28.1)
  • Fewest 4-in-6 stretches on average: New York Knicks (22.2)

Question 4: This is a written question. Please leave your response in the document under Question 4.

Question 5:

  • BKN Defensive eFG%: 54.3%
  • When opponent on a B2B: 53.5%

Part 2

Please show your work in the document, you don’t need anything here.

Part 3

Question 9:

  • Most Helped by Schedule: Cleveland Cavaliers (+0.6 wins)
  • Most Hurt by Schedule: Orlando Magic (-0.6 wins)

Setup and Data

library(tidyverse)
# Note, you will likely have to change these paths. If your data is in the same folder as this project, 
# the paths will likely be fixed for you by deleting ../../Data/schedule_project/ from each string.
schedule <- read_csv("schedule.csv")
draft_schedule <- read_csv("schedule_24_partial.csv")
locations <- read_csv("locations.csv")
game_data <- read_csv("team_game_data.csv")

Part 1 – Schedule Analysis

In this section, you’re going to work to answer questions using NBA scheduling data.

Question 1

QUESTION: How many times are the Thunder scheduled to play 4 games in 6 nights in the provided 80-game draft of the 2024-25 season schedule? (Note: clarification, the stretches can overlap, the question is really “How many games are the 4th game played over the past 6 nights?”)

# Here and for all future questions, feel free to add as many code chunks as you like. Do NOT put echo = F though, we'll want to see your code.

four_in_six = function(dates) {
  dates = as.Date(dates) 
  vapply(dates, function(d) sum(dates >= (d - 5) & dates <= d) == 4L, logical(1))
  }

okc_4in6 = draft_schedule %>%
  mutate(gamedate = as.Date(gamedate)) %>%
  filter(team == "OKC") %>%
  arrange(gamedate) %>%
  mutate(is_4in6 = four_in_six(gamedate))

sum(okc_4in6$is_4in6)
## [1] 26

ANSWER 1:

26 4-in-6 stretches in OKC’s draft schedule.

Question 2

QUESTION: From 2014-15 to 2023-24, what is the average number of 4-in-6 stretches for a team in a season? Adjust each team/season to per-82 games before taking your final average.

team_rows = function(df) {
  if (all(c("home_team", "away_team") %in% names(df))) 
    {
    home = df %>% transmute(season, gamedate = as.Date(gamedate), team = home_team)
    away = df %>% transmute(season, gamedate = as.Date(gamedate), team = away_team)
    bind_rows(home, away)
    } 
  else {
    df %>% transmute(season, gamedate = as.Date(gamedate), team)
    }
}

team_sched = team_rows(schedule) %>%
  filter(season >= 2014, season <= 2023) %>%   
  arrange(team, season, gamedate)

per_team_season = team_sched %>%
  group_by(team, season) %>%
  arrange(gamedate, .by_group = TRUE) %>%
  mutate(is_4in6 = four_in_six(gamedate)) %>%
  summarise(games = n(), four_in_six = sum(is_4in6), .groups = "drop") %>%
  mutate(four_in_six_per82 = four_in_six * (82 / games))

avg_per82_4in6 = mean(per_team_season$four_in_six_per82, na.rm = TRUE)
avg_per82_4in6
## [1] 25.09998

ANSWER 2:

25.1 4-in-6 stretches on average.

Question 3

QUESTION: Which of the 30 NBA teams has had the highest average number of 4-in-6 stretches between 2014-15 and 2023-24? Which team has had the lowest average? Adjust each team/season to per-82 games.

team_avgs = per_team_season %>%
  group_by(team) %>%
  summarise(avg_4in6_per82 = mean(four_in_six_per82, na.rm = TRUE), .groups = "drop")

most_4in6 = team_avgs %>% slice_max(avg_4in6_per82, n = 1)

fewest_4in6 = team_avgs %>% slice_min(avg_4in6_per82, n = 1)

most_4in6
## # A tibble: 1 × 2
##   team  avg_4in6_per82
##   <chr>          <dbl>
## 1 CHA             28.1
fewest_4in6
## # A tibble: 1 × 2
##   team  avg_4in6_per82
##   <chr>          <dbl>
## 1 NYK             22.2

ANSWER 3:

  • Most 4-in-6 stretches on average: Charlotte Hornets (28.1)
  • Fewest 4-in-6 stretches on average: New York Knicks (22.2)
highlight_teams = c("CHA", "NYK")

plot_df = team_avgs %>%
  mutate(highlight = dplyr::case_when(team == "OKC" ~ "OKC", team %in%
                                        highlight_teams ~ "Highlighted", TRUE ~ "Other"))

bold_labels <- c("CHA","NYK","OKC")

ggplot(plot_df, aes(x = reorder(team, avg_4in6_per82), y = avg_4in6_per82, fill = highlight)) +
  geom_col() +
  coord_flip() +
  geom_text(aes(label = sprintf("%.1f", avg_4in6_per82)), hjust = -0.1, size = 3) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.10))) +
  scale_fill_manual(values = c("Other" = "steelblue","Highlighted" = "burlywood", 
                               "OKC" = "darkorange"), guide = "none") +
  labs(
    title = "Per-82 Average 4-in-6 Stretches by Team (2014–15 to 2023–24)",
    subtitle = "Charlotte Hornets, New York Knicks highlighted; OKC in orange",
    x = "Teams", y = "Average 4-in-6 per 82 games") + theme_minimal(base_size = 10)

Question 4

QUESTION: Is the difference between most and least from Q3 surprising, or do you expect that size difference is likely to be the result of chance?

set.seed(42)

team_avg_diff = diff(range(team_avgs$avg_4in6_per82, na.rm = TRUE))

perm = function(df) {
  df %>%
    group_by(season) %>%
    mutate(team = sample(team)) %>%   
    ungroup() %>%
    group_by(team) %>%
    summarise(avg_4in6_per82 = mean(four_in_six_per82, na.rm = TRUE), .groups = "drop") %>%
    summarise(gap = max(avg_4in6_per82) - min(avg_4in6_per82)) %>%
    pull(gap)
}

B = 5000
null_diff = replicate(B, perm(per_team_season))


p_val = mean(null_diff >= team_avg_diff)

q95 = quantile(null_diff, 0.95)

list(observed_gap = team_avg_diff, p_value = p_val, null_95th_percentile = q95)
## $observed_gap
## [1] 5.923077
## 
## $p_value
## [1] 0.0664
## 
## $null_95th_percentile
##      95% 
## 6.044829

ANSWER 4:

  • No, the difference between the most and least from Q3 is not surprising. I believe that the size difference is likely to be the result of chance based on the fact that the permutation test found the p-value to be 0.0664, meaning that a gap that big would occur just by luck about 6-7 times out of 100. The difference would be surprising if it would happen fewer than 5 times out of 100, which means the p-value would have to be less than or equal to 0.05.

Question 5

QUESTION: What was BKN’s defensive eFG% in the 2023-24 season? What was their defensive eFG% that season in situations where their opponent was on the second night of back-to-back?

gd = game_data %>%
  mutate(gamedate = as.Date(gamedate), season   = as.integer(season)) %>%
  group_by(season, off_team) %>% arrange(gamedate, .by_group = TRUE) %>%
  mutate(is_second_b2b = as.integer(gamedate - lag(gamedate) == 1)) %>% ungroup()

opp_vs_bkn_2023 = gd %>% filter(season == 2023, def_team %in% c("BKN","BRK"), fgattempted > 0)

bkn_def_efg = 100 * with(opp_vs_bkn_2023,
  sum(fgmade + 0.5 * fg3made, na.rm = TRUE) / sum(fgattempted, na.rm = TRUE))

opp_b2b = opp_vs_bkn_2023 %>% filter(is_second_b2b == 1)

bkn_def_efg_b2b = 100 * with(opp_b2b,
  sum(fgmade + 0.5 * fg3made, na.rm = TRUE) / sum(fgattempted, na.rm = TRUE))

sprintf("BKN Defensive eFG%%: %.1f%%", bkn_def_efg)
## [1] "BKN Defensive eFG%: 54.3%"
sprintf("When opponent on a B2B (second night): %.1f%%", bkn_def_efg_b2b)
## [1] "When opponent on a B2B (second night): 53.5%"

ANSWER 5:

  • BKN Defensive eFG%: 54.3%
  • When opponent on a B2B: 53.5%

.

.

Question 8

QUESTION: Using your tool, what is the best and worst part of OKC’s 2024-25 draft schedule? Please give your answer as a short brief to members of the front office and coaching staff to set expectations going into the season. You can include context from past schedules.

ANSWER 8:

Using the tool above, I would say the best part of OKC’s 2024-25 draft schedule is the presence of several useful home stands (green bands), which include the 4+ day home blocks that create practice and recovery windows. These long stretches are ideal for preparing late-game packages, tightening defensive coverages, and continuing players development without the stress of traveling. The worst part of the 2024-2025 draft schedule is the mid-season density pocket. During this stretch there is a short-rest, away-heavy run with a couple of 3-in-4 bursts that can shorten preparation, increase total fatigue, and raise injury risk if unmanaged. In context, OKC’s historical load (2014–15 to 2023–24) averages roughly 26.1 4-in-6 events per 82 games, which is slightly above league average, so the overall stress is familiar but I believe the mid-season cluster is the primary negative to plan around.

A couple recommendations that could counteract these negatives would be to treat the home stands as install windows, for example practicing new end of game situations or switching offensive packages, and schedule heavier skill work and scouting preparation there. For the mid-season pocket, I would pre-plan rotations/usage and consider staggered rest. Simplify game plans on the road and lean on basic coverages to lower the mental load. In 4-in-6 weeks, set minutes restrictions in advance, shift practices toward new tactics and film, and increase preventative exercise and treatments. Operationally, favor same-night flights when possible and front-load nutrition and treatment on tight turnarounds. Identify plug-and-play bench groups and keep two-way depth warm for the hardest and most busy 10 to 14 days. The schedule is very workable and even favorable if we leverage the home stands and proactively manage the mid-season game density.

Part 3 – Modeling

Question 9

QUESTION: Please estimate how many more/fewer regular season wins each team has had due to schedule-related factors from 2019-20 though 2023-24. Your final answer should have one number for each team, representing the total number of wins (not per 82, and not a per-season average). You may consider the on-court strength of the scheduled opponents as well as the impact of travel/schedule density. Please include the teams and estimates for the most helped and most hurt in the answer key.

If you fit a model to help answer this question, please write a paragraph explaining your model, and include a simple model diagnostic (eg a printed summary of a regression, a variable importance plot, etc).

schedule = schedule %>% 
  mutate(gamedate = as.Date(if ("gamedate" %in% names(schedule)) gamedate
                            else if ("game_date" %in% names(schedule)) game_date
                            else date))

home_away = {
  s = schedule
  has = function(x) x %in% names(s)
  if (has("home")) {
    s = s %>% 
      mutate(home = case_when(
        is.logical(home) ~ home,
        tolower(as.character(home)) %in% c("1","t","true","yes","y","h","home") ~ TRUE,
        tolower(as.character(home)) %in% c("0","f","false","no","n","a","away") ~ FALSE,
        TRUE ~ NA
      ))
  } else if (has("home_team") && has("team")) {
    s = s %>%  mutate(home = team == .data$home_team)
  } else if (has("homeaway")) {
    s = s %>%  mutate(home = tolower(as.character(homeaway)) %in% c("home","h"))
  } else if (has("location")) {
    s = s %>%  mutate(home = tolower(as.character(location)) %in% c("home","h"))
  } else {
    s = s %>%  mutate(home = NA)
  }
  s %>%  transmute(team, gamedate, home = as.logical(home))
}

if (!exists("gd_feat")) {
  gd_feat = schedule %>% 
    select(team, opponent, gamedate) %>% 
    left_join(home_away, by = c("team","gamedate")) %>% 
    mutate(rest_days = NA_real_, dist_km = NA_real_)
}


ha_19_24 = home_away %>% 
  filter(year(gamedate) >= 2019, year(gamedate) <= 2024)

win_df = NULL
if (exists("gd_feat") && all(c("team","gamedate","win") %in% names(gd_feat))) {
  win_df = gd_feat %>% 
    transmute(team, gamedate = as.Date(gamedate), win = as.numeric(as.logical(win)))
} else if ("win" %in% names(schedule)) {
  win_df = schedule %>% 
    transmute(team, gamedate = as.Date(if ("gamedate" %in% names(schedule)) gamedate else
                                       if ("game_date" %in% names(schedule)) game_date else date), win = as.numeric(as.logical(win)))
}

ha_out = ha_19_24 %>% 
  left_join(win_df, by = c("team","gamedate"))

estimate_edge = function(df) {
  if (!"win" %in% names(df) || all(is.na(df$win))) return(0.14)          
  df2 = df %>%  filter(!is.na(home), !is.na(win))
  if (nrow(df2) < 100) return(0.14)
  home_wp = mean(df2$win[df2$home %in% TRUE],  na.rm = TRUE)
  away_wp = mean(df2$win[df2$home %in% FALSE], na.rm = TRUE)
  edge = home_wp - away_wp
  pmin(pmax(edge, 0.08), 0.20)
}

home_edge = estimate_edge(ha_out)

team_totals = ha_19_24 %>% 
  group_by(team) %>% 
  summarise(
    games = n(),
    home_share = mean(home, na.rm = TRUE),
    .groups = "drop"
  ) %>% 
  mutate(schedule_wins = (home_share - 0.5) * games * home_edge) %>% 
  arrange(desc(schedule_wins))

most_helped = team_totals %>%  slice_max(schedule_wins, n = 1)
most_hurt = team_totals %>%  slice_min(schedule_wins, n = 1)

team_totals %>% 
  mutate(team = forcats::fct_reorder(team, schedule_wins)) %>% 
  ggplot(aes(team, schedule_wins)) +
  geom_col() +
  coord_flip() +
  geom_hline(yintercept = 0, linetype = "dashed", linewidth = 0.6) +
  scale_y_continuous(labels = function(x) sprintf("%+.1f", x)) +
  labs(
    title = "Estimated Wins Gained/Lost from Home-Share Imbalance (2019–2024)",
    x = NULL, y = "Wins due to home-share imbalance"
  ) + theme_minimal(base_size = 12)

ANSWER 9:

  • Most Helped by Schedule: Cleveland Cavaliers (+0.6 wins)
  • Most Hurt by Schedule: Orlando Magic (-0.6 wins)

To figure out how the schedule helped or hurt each team between 2019–20 and 2023–24, I looked at how many home games each team played compared to a balanced 50/50 home-and-away split. Using league data from this period, I estimated the typical home court advantage; basically how much teams tend to perform better at home than on the road. I then translated each team’s home-game imbalance into an estimate of extra wins or losses caused by the schedule. Teams with more home games than average ended up being slightly helped, while those with more road games were slightly hurt. This estimate focuses only on venue effects and doesn’t account for opponent strength or travel fatigue, so the numbers should be seen as a rough but fair measure of how the schedule itself influenced total wins.